DDPG (Deep Deterministic Policy Gradient) — from scratch in PyTorch#
DDPG is an off-policy actor–critic algorithm for continuous control.
This notebook implements DDPG at a low level in PyTorch:
replay buffer
actor + critic networks
target networks with soft updates
exploration noise for deterministic policies
a clean training loop + Plotly diagnostics
We’ll train on a simple continuous environment (default: Pendulum-v1) and plot:
score per episode (learning curve)
Q-values / TD targets during learning
policy evolution on fixed probe states
Learning goals#
By the end you should be able to:
explain the actor–critic factorization in DDPG and what each network learns
write the critic target with target networks precisely
understand why DDPG needs (1) experience replay and (2) target networks
implement DDPG updates in low-level PyTorch (no RL libraries)
interpret common diagnostics: returns, losses, Q-values, and policy drift
Prerequisites#
basic PyTorch (
nn.Module, optimizers)Bellman equation / TD learning intuition
continuous action spaces (Box)
1) DDPG structure (actor–critic) and target networks#
Actor (deterministic policy)#
The actor is a deterministic policy network:
In practice we output a tanh-bounded action and then scale to match the environment’s action bounds.
Critic (action-value function)#
The critic estimates the Q-value for a state–action pair:
Target networks (the stabilizer)#
Bootstrapping makes the target depend on the current function approximators. To reduce moving-target instability, DDPG maintains slowly-updated copies:
target actor: \(\mu_{\theta'}\)
target critic: \(Q_{\phi'}\)
Soft-update them after each gradient step:
Critic target (precise)#
For a transition \((s,a,r,s',d)\) sampled from replay (where \(d\in\{0,1\}\) indicates terminal), the TD target is
and we fit the critic via
Actor objective (deterministic policy gradient)#
The actor is trained to maximize the critic’s value under its actions:
In code we minimize the actor loss
The gradient is the deterministic policy gradient:
PyTorch computes this automatically when we backprop through Q(s, actor(s)).
2) Algorithm sketch (pseudocode)#
Initialize actor \(\mu_\theta\), critic \(Q_\phi\)
Initialize target networks \(\mu_{\theta'}\leftarrow\mu_\theta\), \(Q_{\phi'}\leftarrow Q_\phi\)
Initialize replay buffer \(\mathcal{D}\)
For each environment step:
act with exploration: \(a=\mu_\theta(s)+\epsilon\)
store \((s,a,r,s',d)\) in \(\mathcal{D}\)
sample minibatch from \(\mathcal{D}\)
critic: regress \(Q_\phi(s,a)\) to \(y=r+\gamma(1-d)Q_{\phi'}(s',\mu_{\theta'}(s'))\)
actor: ascend \(\nabla_\theta Q_\phi(s,\mu_\theta(s))\)
soft update targets: \((\theta',\phi')\leftarrow \tau(\theta,\phi)+(1-\tau)(\theta',\phi')\)
import math
import platform
import time
from dataclasses import dataclass
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly
import os
import plotly.io as pio
try:
import torch
import torch.nn as nn
import torch.nn.functional as F
TORCH_AVAILABLE = True
except Exception as e:
TORCH_AVAILABLE = False
_TORCH_IMPORT_ERROR = e
# Gymnasium first; fall back to gym
try:
import gymnasium as gym
GYM_BACKEND = 'gymnasium'
except Exception:
import gym
GYM_BACKEND = 'gym'
pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
print('Python', platform.python_version())
print('NumPy', np.__version__)
print('Pandas', pd.__version__)
print('Plotly', plotly.__version__)
print('Gym backend', GYM_BACKEND, 'version', gym.__version__)
print('Torch', torch.__version__ if TORCH_AVAILABLE else _TORCH_IMPORT_ERROR)
Python 3.12.9
NumPy 1.26.2
Pandas 2.1.3
Plotly 6.5.2
Gym backend gymnasium version 1.1.1
Torch 2.7.0+cu126
# --- Run configuration ---
FAST_RUN = True # set False for longer training
ENV_ID = 'Pendulum-v1'
SEED = 42
NUM_EPISODES = 40 if FAST_RUN else 250
MAX_STEPS_PER_EPISODE = None # None means use env default
REPLAY_SIZE = 200_000
BATCH_SIZE = 128
GAMMA = 0.99
TAU = 0.005
ACTOR_LR = 1e-3
CRITIC_LR = 1e-3
START_STEPS = 2_000 # random actions before using the actor + noise
UPDATE_AFTER = 1_000 # start gradient updates after this many steps
UPDATES_PER_STEP = 1
NOISE_SIGMA = 0.1 # exploration noise std (in action units after scaling)
HIDDEN_SIZES = (256, 256)
GRAD_CLIP_NORM = 1.0
PROBE_N = 32
PROBE_EVERY_EPISODES = 5
DEVICE = 'cuda' if TORCH_AVAILABLE and torch.cuda.is_available() else 'cpu'
print('DEVICE:', DEVICE)
DEVICE: cpu
/home/tempa/miniconda3/lib/python3.12/site-packages/torch/cuda/__init__.py:174: UserWarning:
CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
def set_global_seeds(seed: int):
np.random.seed(seed)
if TORCH_AVAILABLE:
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
def env_reset(env, seed: int | None = None):
out = env.reset(seed=seed) if seed is not None else env.reset()
if isinstance(out, tuple):
obs, info = out
else:
obs, info = out, {}
return obs, info
def env_step(env, action):
out = env.step(action)
if len(out) == 5:
next_obs, reward, terminated, truncated, info = out
done = bool(terminated or truncated)
else:
next_obs, reward, done, info = out
done = bool(done)
return next_obs, float(reward), done, info
def make_env(env_id: str, seed: int):
env = gym.make(env_id)
_ = env_reset(env, seed=seed)
try:
env.action_space.seed(seed)
env.observation_space.seed(seed)
except Exception:
pass
return env
def action_scale_and_bias(action_space):
# Works for gymnasium.spaces.Box and gym.spaces.Box
high = np.asarray(action_space.high, dtype=np.float32)
low = np.asarray(action_space.low, dtype=np.float32)
scale = (high - low) / 2.0
bias = (high + low) / 2.0
return scale, bias
class ReplayBuffer:
def __init__(self, obs_dim: int, act_dim: int, size: int, seed: int):
self.obs_buf = np.zeros((size, obs_dim), dtype=np.float32)
self.next_obs_buf = np.zeros((size, obs_dim), dtype=np.float32)
self.act_buf = np.zeros((size, act_dim), dtype=np.float32)
self.rew_buf = np.zeros((size, 1), dtype=np.float32)
self.done_buf = np.zeros((size, 1), dtype=np.float32)
self.max_size = int(size)
self.ptr = 0
self.size = 0
self.rng = np.random.default_rng(seed)
def add(self, obs, act, rew: float, next_obs, done: bool):
self.obs_buf[self.ptr] = obs
self.act_buf[self.ptr] = act
self.rew_buf[self.ptr] = rew
self.next_obs_buf[self.ptr] = next_obs
self.done_buf[self.ptr] = float(done)
self.ptr = (self.ptr + 1) % self.max_size
self.size = min(self.size + 1, self.max_size)
def sample(self, batch_size: int):
idx = self.rng.integers(0, self.size, size=batch_size)
batch = dict(
obs=self.obs_buf[idx],
act=self.act_buf[idx],
rew=self.rew_buf[idx],
next_obs=self.next_obs_buf[idx],
done=self.done_buf[idx],
)
return batch
def mlp(sizes, activation=nn.ReLU, output_activation=nn.Identity):
layers = []
for i in range(len(sizes) - 1):
act = activation if i < len(sizes) - 2 else output_activation
layers += [nn.Linear(sizes[i], sizes[i + 1]), act()]
return nn.Sequential(*layers)
class Actor(nn.Module):
def __init__(self, obs_dim: int, act_dim: int, hidden_sizes, action_scale, action_bias):
super().__init__()
self.net = mlp([obs_dim, *hidden_sizes, act_dim], activation=nn.ReLU, output_activation=nn.Tanh)
self.register_buffer('action_scale', torch.as_tensor(action_scale, dtype=torch.float32))
self.register_buffer('action_bias', torch.as_tensor(action_bias, dtype=torch.float32))
def forward(self, obs):
a = self.net(obs)
return self.action_scale * a + self.action_bias
class Critic(nn.Module):
def __init__(self, obs_dim: int, act_dim: int, hidden_sizes):
super().__init__()
self.net = mlp([obs_dim + act_dim, *hidden_sizes, 1], activation=nn.ReLU, output_activation=nn.Identity)
def forward(self, obs, act):
x = torch.cat([obs, act], dim=-1)
return self.net(x)
@dataclass
class DDPGConfig:
gamma: float = GAMMA
tau: float = TAU
actor_lr: float = ACTOR_LR
critic_lr: float = CRITIC_LR
batch_size: int = BATCH_SIZE
grad_clip_norm: float | None = GRAD_CLIP_NORM
class DDPGAgent:
def __init__(self, obs_dim: int, act_dim: int, action_scale, action_bias, hidden_sizes, device: str, cfg: DDPGConfig):
self.device = torch.device(device)
self.cfg = cfg
self.actor = Actor(obs_dim, act_dim, hidden_sizes, action_scale, action_bias).to(self.device)
self.critic = Critic(obs_dim, act_dim, hidden_sizes).to(self.device)
# Target networks start as exact copies
self.target_actor = Actor(obs_dim, act_dim, hidden_sizes, action_scale, action_bias).to(self.device)
self.target_critic = Critic(obs_dim, act_dim, hidden_sizes).to(self.device)
self.target_actor.load_state_dict(self.actor.state_dict())
self.target_critic.load_state_dict(self.critic.state_dict())
self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=cfg.actor_lr)
self.critic_opt = torch.optim.Adam(self.critic.parameters(), lr=cfg.critic_lr)
@torch.no_grad()
def act(self, obs: np.ndarray, noise_sigma: float = 0.0):
obs_t = torch.as_tensor(obs, dtype=torch.float32, device=self.device).unsqueeze(0)
action = self.actor(obs_t).cpu().numpy().squeeze(0)
if noise_sigma > 0:
action = action + np.random.normal(0.0, noise_sigma, size=action.shape).astype(np.float32)
return action
def update(self, batch):
obs = torch.as_tensor(batch['obs'], dtype=torch.float32, device=self.device)
act = torch.as_tensor(batch['act'], dtype=torch.float32, device=self.device)
rew = torch.as_tensor(batch['rew'], dtype=torch.float32, device=self.device)
next_obs = torch.as_tensor(batch['next_obs'], dtype=torch.float32, device=self.device)
done = torch.as_tensor(batch['done'], dtype=torch.float32, device=self.device)
# --- Critic update ---
with torch.no_grad():
next_act = self.target_actor(next_obs)
target_q_next = self.target_critic(next_obs, next_act)
y = rew + self.cfg.gamma * (1.0 - done) * target_q_next
q = self.critic(obs, act)
critic_loss = F.mse_loss(q, y)
self.critic_opt.zero_grad(set_to_none=True)
critic_loss.backward()
if self.cfg.grad_clip_norm is not None:
torch.nn.utils.clip_grad_norm_(self.critic.parameters(), self.cfg.grad_clip_norm)
self.critic_opt.step()
# --- Actor update ---
self.actor_opt.zero_grad(set_to_none=True)
actor_actions = self.actor(obs)
actor_loss = -self.critic(obs, actor_actions).mean()
actor_loss.backward()
if self.cfg.grad_clip_norm is not None:
torch.nn.utils.clip_grad_norm_(self.actor.parameters(), self.cfg.grad_clip_norm)
self.actor_opt.step()
# --- Soft update target networks ---
with torch.no_grad():
for p, p_targ in zip(self.actor.parameters(), self.target_actor.parameters()):
p_targ.data.mul_(1.0 - self.cfg.tau)
p_targ.data.add_(self.cfg.tau * p.data)
for p, p_targ in zip(self.critic.parameters(), self.target_critic.parameters()):
p_targ.data.mul_(1.0 - self.cfg.tau)
p_targ.data.add_(self.cfg.tau * p.data)
metrics = {
'critic_loss': float(critic_loss.item()),
'actor_loss': float(actor_loss.item()),
'q_mean': float(q.detach().mean().item()),
'y_mean': float(y.detach().mean().item()),
}
return metrics
def moving_average(x, window: int):
x = np.asarray(x, dtype=np.float64)
if len(x) < window:
return x
kernel = np.ones(window) / window
return np.convolve(x, kernel, mode='valid')
def train_ddpg(env_id: str, seed: int):
set_global_seeds(seed)
env = make_env(env_id, seed=seed)
obs_dim = int(np.prod(env.observation_space.shape))
act_dim = int(np.prod(env.action_space.shape))
act_scale, act_bias = action_scale_and_bias(env.action_space)
buf = ReplayBuffer(obs_dim, act_dim, size=REPLAY_SIZE, seed=seed)
agent = DDPGAgent(
obs_dim=obs_dim,
act_dim=act_dim,
action_scale=act_scale,
action_bias=act_bias,
hidden_sizes=HIDDEN_SIZES,
device=DEVICE,
cfg=DDPGConfig(),
)
max_steps = MAX_STEPS_PER_EPISODE or getattr(env, '_max_episode_steps', 200)
logs = {
'episode': [],
'episode_return': [],
'episode_length': [],
'global_step_end': [],
# per-update metrics
'update_step': [],
'actor_loss': [],
'critic_loss': [],
'q_mean': [],
'y_mean': [],
# probe snapshots
'probe_episode': [],
'probe_action_stat': [],
'probe_q': [],
}
probe_states = None
global_step = 0
update_step = 0
t0 = time.time()
for ep in range(1, NUM_EPISODES + 1):
obs, _ = env_reset(env, seed=seed + ep)
obs = np.asarray(obs, dtype=np.float32).reshape(-1)
ep_return = 0.0
ep_len = 0
for _ in range(max_steps):
if global_step < START_STEPS:
action = env.action_space.sample()
else:
action = agent.act(obs, noise_sigma=NOISE_SIGMA)
# clip to action bounds
action = np.clip(action, env.action_space.low, env.action_space.high).astype(np.float32)
next_obs, reward, done, _ = env_step(env, action)
next_obs = np.asarray(next_obs, dtype=np.float32).reshape(-1)
buf.add(obs, action, reward, next_obs, done)
obs = next_obs
ep_return += reward
ep_len += 1
global_step += 1
# gradient updates
if global_step >= UPDATE_AFTER and buf.size >= BATCH_SIZE:
for _u in range(UPDATES_PER_STEP):
batch = buf.sample(BATCH_SIZE)
metrics = agent.update(batch)
logs['update_step'].append(update_step)
logs['actor_loss'].append(metrics['actor_loss'])
logs['critic_loss'].append(metrics['critic_loss'])
logs['q_mean'].append(metrics['q_mean'])
logs['y_mean'].append(metrics['y_mean'])
update_step += 1
if done:
break
logs['episode'].append(ep)
logs['episode_return'].append(ep_return)
logs['episode_length'].append(ep_len)
logs['global_step_end'].append(global_step)
# Fix a set of probe states once replay has enough data
if probe_states is None and buf.size >= max(PROBE_N, BATCH_SIZE):
probe_states = buf.sample(PROBE_N)['obs']
# Snapshot policy + Q on probe states to visualize policy evolution
if probe_states is not None and (ep % PROBE_EVERY_EPISODES == 0 or ep == NUM_EPISODES):
with torch.no_grad():
ps = torch.as_tensor(probe_states, dtype=torch.float32, device=agent.device)
pa = agent.actor(ps).cpu().numpy()
pq = agent.critic(ps, agent.actor(ps)).cpu().numpy().reshape(-1)
if pa.shape[1] == 1:
policy_stat = pa[:, 0] # 1D actions
else:
policy_stat = np.linalg.norm(pa, axis=1) # multi-dim summary
logs['probe_episode'].append(ep)
logs['probe_action_stat'].append(policy_stat)
logs['probe_q'].append(pq)
if ep % 10 == 0 or ep == 1 or ep == NUM_EPISODES:
elapsed = time.time() - t0
print(f'Episode {ep:4d} | return {ep_return:8.1f} | len {ep_len:3d} | steps {global_step:6d} | elapsed {elapsed:6.1f}s')
env.close()
return logs
logs = train_ddpg(ENV_ID, seed=SEED)
print('Done. Episodes:', len(logs['episode']), 'Updates:', len(logs['update_step']))
Episode 1 | return -865.7 | len 200 | steps 200 | elapsed 0.0s
Episode 10 | return -836.3 | len 200 | steps 2000 | elapsed 11.1s
Episode 20 | return -257.9 | len 200 | steps 4000 | elapsed 40.1s
Episode 30 | return -127.4 | len 200 | steps 6000 | elapsed 71.2s
Episode 40 | return -381.6 | len 200 | steps 8000 | elapsed 102.5s
Done. Episodes: 40 Updates: 7001
3) Plotly diagnostics#
DDPG can look like it’s learning while the critic is quietly diverging, so we’ll monitor:
episode return (score)
critic loss and actor loss
Q-values vs TD targets (sanity check)
policy evolution on fixed probe states (is the policy drifting smoothly?)
# --- Learning curve: score per episode ---
df_ep = pd.DataFrame({
'episode': logs['episode'],
'return': logs['episode_return'],
'length': logs['episode_length'],
})
ma_window = 10
ma = moving_average(df_ep['return'].values, window=ma_window)
ma_x = df_ep['episode'].values[ma_window - 1:]
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_ep['episode'], y=df_ep['return'], mode='lines', name='Return'))
if len(ma) == len(ma_x):
fig.add_trace(go.Scatter(x=ma_x, y=ma, mode='lines', name=f'Return (MA {ma_window})'))
fig.update_layout(title='DDPG learning curve (score per episode)', xaxis_title='Episode', yaxis_title='Return')
fig.show()
# --- Q-values, TD targets, and losses over update steps ---
df_up = pd.DataFrame({
'update_step': logs['update_step'],
'critic_loss': logs['critic_loss'],
'actor_loss': logs['actor_loss'],
'q_mean': logs['q_mean'],
'y_mean': logs['y_mean'],
})
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_up['update_step'], y=df_up['q_mean'], mode='lines', name='Q(s,a) mean'))
fig.add_trace(go.Scatter(x=df_up['update_step'], y=df_up['y_mean'], mode='lines', name='TD target y mean'))
fig.update_layout(title='Critic outputs vs TD targets (mean over minibatch)', xaxis_title='Update step', yaxis_title='Value')
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_up['update_step'], y=df_up['critic_loss'], mode='lines', name='Critic loss (MSE)'))
fig.add_trace(go.Scatter(x=df_up['update_step'], y=df_up['actor_loss'], mode='lines', name='Actor loss (-Q)'))
fig.update_layout(title='Actor/Critic losses', xaxis_title='Update step', yaxis_title='Loss')
fig.show()
# --- Policy evolution on fixed probe states ---
if len(logs['probe_episode']) > 0:
probe_eps = logs['probe_episode']
z_action = np.stack(logs['probe_action_stat'], axis=1) # (PROBE_N, T)
z_q = np.stack(logs['probe_q'], axis=1) # (PROBE_N, T)
fig = go.Figure(data=go.Heatmap(
z=z_action,
x=probe_eps,
y=list(range(z_action.shape[0])),
colorscale='RdBu',
zmid=0.0,
colorbar=dict(title='action (1D) or ||a||'),
))
fig.update_layout(title='Policy evolution on fixed probe states', xaxis_title='Episode snapshot', yaxis_title='Probe state index')
fig.show()
fig = go.Figure(data=go.Heatmap(
z=z_q,
x=probe_eps,
y=list(range(z_q.shape[0])),
colorscale='Viridis',
colorbar=dict(title='Q(s, mu(s))'),
))
fig.update_layout(title='Q-values on probe states (critic under current actor)', xaxis_title='Episode snapshot', yaxis_title='Probe state index')
fig.show()
else:
print('No probe snapshots recorded (try increasing NUM_EPISODES or reducing PROBE_EVERY_EPISODES).')
4) Stable-Baselines implementation (if you want a reference)#
If you want a battle-tested baseline, Stable-Baselines has DDPG implementations.
Notes:
stable-baselines3(PyTorch) andstable-baselines(TensorFlow) are different packages.This repository’s environment may not have them installed; the code below is for reference.
Stable-Baselines3 (PyTorch)#
# pip install stable-baselines3
import numpy as np
import gymnasium as gym
from stable_baselines3 import DDPG
from stable_baselines3.common.noise import NormalActionNoise
env = gym.make('Pendulum-v1')
n_actions = env.action_space.shape[0]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))
model = DDPG('MlpPolicy', env, action_noise=action_noise, verbose=1)
model.learn(total_timesteps=200_000)
Stable-Baselines (TensorFlow; older/archived)#
# pip install stable-baselines
import numpy as np
import gym
from stable_baselines import DDPG
from stable_baselines.ddpg.policies import MlpPolicy
from stable_baselines.common.noise import NormalActionNoise
env = gym.make('Pendulum-v1')
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))
model = DDPG(MlpPolicy, env, action_noise=action_noise, verbose=1)
model.learn(total_timesteps=200_000)
5) Pitfalls + diagnostics#
Exploration: deterministic policies need explicit noise; too little noise → no learning.
Q-value blow-up: if \(Q\) grows without bound, reduce learning rates, add gradient clipping, check reward scale.
Action scaling: always scale
tanhoutputs to environment bounds; otherwise the critic learns on invalid actions.Replay warm-up: start updates only after enough diverse transitions exist.
Overestimation bias: DDPG can overestimate; TD3 addresses this with twin critics + target smoothing.
6) Hyperparameters explained (the ones that matter)#
GAMMA (\(\gamma\))#
Discount factor in the TD target:
closer to 1 → longer-horizon credit assignment, but bootstrapping is harder
smaller → more myopic, often more stable
TAU (\(\tau\))#
Soft-update rate for target networks:
smaller (e.g. 0.001) → targets change slowly (stable, but may learn slower)
larger (e.g. 0.02) → targets track faster (less bias, potentially less stable)
REPLAY_SIZE#
Maximum transitions stored.
too small → poor diversity, correlated samples
very large → more diversity but older data (off-policy mismatch) and more memory
BATCH_SIZE#
Minibatch size for gradient updates.
larger → smoother gradients, higher compute
smaller → noisier updates (can help exploration but can destabilize critic)
START_STEPS#
How long to act randomly before relying on the actor.
helps fill replay with diverse transitions
if too short, early actor updates overfit to narrow experience
UPDATE_AFTER#
Delay before starting gradient updates.
ensures the critic’s first targets aren’t based on tiny replay buffers
UPDATES_PER_STEP#
How many gradient updates to do per environment step.
1is the standard simple choicelarger values increase sample reuse but can overfit to replay and amplify instability
NOISE_SIGMA#
Exploration noise standard deviation (added to the actor’s action).
too small → agent may not discover better actions
too large → behavior becomes too random; critic targets get noisy
GRAD_CLIP_NORM#
Gradient norm clipping (optional).
helps prevent occasional exploding gradients in the critic
7) Exercises + references#
Exercises#
Replace Gaussian exploration with Ornstein–Uhlenbeck noise and compare learning.
Add LayerNorm to the actor/critic MLPs; does it stabilize training?
Implement TD3 changes (twin critics + target policy smoothing) and compare the Q-value diagnostics.
References#
Lillicrap et al., Continuous control with deep reinforcement learning (DDPG): https://arxiv.org/abs/1509.02971
OpenAI Spinning Up (DDPG explanation + tips): https://spinningup.openai.com/en/latest/algorithms/ddpg.html
Stable-Baselines (archived TF implementations): https://github.com/hill-a/stable-baselines
Stable-Baselines3 docs: https://stable-baselines3.readthedocs.io/